hardware resource
Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases
Boullé, Marc, Voisine, Nicolas, Guerraz, Bruno, Hue, Carine, Olmos, Felipe, Popescu, Vladimir, Gouache, Stéphane, Bouget, Stéphane, Bondu, Alexis, Gauthier, Luc Aurelien, Benrekia, Yassine Nair, Clérot, Fabrice, Lemaire, Vincent
Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.
RAG-Stack: Co-Optimizing RAG Quality and Performance From the Vector Database Perspective
Retrieval-augmented generation (RAG) has emerged as one of the most prominent applications of vector databases. By integrating documents retrieved from a database into the prompt of a large language model (LLM), RAG enables more reliable and informative content generation. While there has been extensive research on vector databases, many open research problems remain once they are considered in the wider context of end-to-end RAG pipelines. One practical yet challenging problem is how to jointly optimize both system performance and generation quality in RAG, which is significantly more complex than it appears due to the numerous knobs on both the algorithmic side (spanning models and databases) and the systems side (from software to hardware). In this paper, we present RAG-Stack, a three-pillar blueprint for quality-performance co-optimization in RAG systems. RAG-Stack comprises: (1) RAG-IR, an intermediate representation that serves as an abstraction layer to decouple quality and performance aspects; (2) RAG-CM, a cost model for estimating system performance given an RAG-IR; and (3) RAG-PE, a plan exploration algorithm that searches for high-quality, high-performance RAG configurations. We believe this three-pillar blueprint will become the de facto paradigm for RAG quality-performance co-optimization in the years to come.
PLoRA: Efficient LoRA Hyperparameter Tuning for Large Models
Yan, Minghao, Wang, Zhuang, Jia, Zhen, Venkataraman, Shivaram, Wang, Yida
Low-rank Adaptation (LoRA) has gained popularity as a fine-tuning approach for Large Language Models (LLMs) due to its low resource requirements and good performance. While a plethora of work has investigated improving LoRA serving efficiency by serving multiple LoRAs concurrently, existing methods assume that a wide range of LoRA adapters are available for serving. In our work, we conduct extensive empirical studies to identify that current training paradigms do not utilize hardware resources efficiently and require high overhead to obtain a performant LoRA. Leveraging these insights, we propose PLoRA, which automatically orchestrates concurrent LoRA fine-tuning jobs under given hardware and model constraints and develops performant kernels to improve training efficiency. Our experimental studies show that PLoRA reduces the makespan of LoRA fine-tuning over a given hyperparameter search space by up to 7.52x and improves training throughput by up to 12.8x across a range of state-of-the-art LLMs.
AI-Driven Optimization of Hardware Overlay Configurations
Designing and optimizing FPGA overlays is a complex and time-consuming process, often requiring multiple trial-and-error iterations to determine a suitable configuration. This paper presents an AI-driven approach to optimizing FPGA overlay configurations, specifically focusing on the NAPOLY+ automata processor implemented on the ZCU104 FPGA. By leveraging machine learning techniques, particularly Random Forest regression, we predict the feasibility and efficiency of different configurations before hardware compilation. Our method significantly reduces the number of required iterations by estimating resource utilization, including logical elements, distributed memory, and fanout, based on historical design data. Experimental results demonstrate that our model achieves high prediction accuracy, closely matching actual resource usage while accelerating the design process.
Exploring the Limitations of Kolmogorov-Arnold Networks in Classification: Insights to Software Training and Hardware Implementation
Tran, Van Duy, Le, Tran Xuan Hieu, Tran, Thi Diem, Pham, Hoai Luan, Le, Vu Trung Duong, Vu, Tuan Hai, Nguyen, Van Tinh, Nakashima, Yasuhiko
Kolmogorov-Arnold Networks (KANs), a novel type of neural network, have recently gained popularity and attention due to the ability to substitute multi-layer perceptions (MLPs) in artificial intelligence (AI) with higher accuracy and interoperability. However, KAN assessment is still limited and cannot provide an in-depth analysis of a specific domain. Furthermore, no study has been conducted on the implementation of KANs in hardware design, which would directly demonstrate whether KANs are truly superior to MLPs in practical applications. As a result, in this paper, we focus on verifying KANs for classification issues, which are a common but significant topic in AI using four different types of datasets. Furthermore, the corresponding hardware implementation is considered using the Vitis high-level synthesis (HLS) tool. To the best of our knowledge, this is the first article to implement hardware for KAN. The results indicate that KANs cannot achieve more accuracy than MLPs in high complex datasets while utilizing substantially higher hardware resources. Therefore, MLP remains an effective approach for achieving accuracy and efficiency in software and hardware implementation.
COSTREAM: Learned Cost Models for Operator Placement in Edge-Cloud Environments
Heinrich, Roman, Binnig, Carsten, Kornmayer, Harald, Luthra, Manisha
In this work, we present COSTREAM, a novel learned cost model for Distributed Stream Processing Systems that provides accurate predictions of the execution costs of a streaming query in an edge-cloud environment. The cost model can be used to find an initial placement of operators across heterogeneous hardware, which is particularly important in these environments. In our evaluation, we demonstrate that COSTREAM can produce highly accurate cost estimates for the initial operator placement and even generalize to unseen placements, queries, and hardware. When using COSTREAM to optimize the placements of streaming operators, a median speed-up of around 21x can be achieved compared to baselines.
Efficient and Mathematically Robust Operations for Certified Neural Networks Inference
Geyer, Fabien, Freitag, Johannes, Schulz, Tobias, Uhrig, Sascha
In recent years, machine learning (ML) and neural networks (NNs) have gained widespread use and attention across various domains, particularly in transportation for achieving autonomy, including the emergence of flying taxis for urban air mobility (UAM). However, concerns about certification have come up, compelling the development of standardized processes encompassing the entire ML and NN pipeline. This paper delves into the inference stage and the requisite hardware, highlighting the challenges associated with IEEE 754 floating-point arithmetic and proposing alternative number representations. By evaluating diverse summation and dot product algorithms, we aim to mitigate issues related to non-associativity. Additionally, our exploration of fixed-point arithmetic reveals its advantages over floating-point methods, demonstrating significant hardware efficiencies. Employing an empirical approach, we ascertain the optimal bit-width necessary to attain an acceptable level of accuracy, considering the inherent complexity of bit-width optimization.
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics
Que, Zhiqiang, Fan, Hongxiang, Loo, Marcus, Li, He, Blott, Michaela, Pierini, Maurizio, Tapper, Alexander, Luk, Wayne
This work presents a novel reconfigurable architecture for Low Latency Graph Neural Network (LL-GNN) designs for particle detectors, delivering unprecedented low latency performance. Incorporating FPGA-based GNNs into particle detectors presents a unique challenge since it requires sub-microsecond latency to deploy the networks for online event selection with a data rate of hundreds of terabytes per second in the Level-1 triggers at the CERN Large Hadron Collider experiments. This paper proposes a novel outer-product based matrix multiplication approach, which is enhanced by exploiting the structured adjacency matrix and a column-major data layout. Moreover, a fusion step is introduced to further reduce the end-to-end design latency by eliminating unnecessary boundaries. Furthermore, a GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under given latency constraints. To facilitate this, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 9.0 times faster and achieves up to 13.1 times higher power efficiency than a GPU implementation. Compared to the previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy. The proposed LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.